Designing efficient randstrobes for sequence similarity analyses

Abstract Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. Results In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. Availability and implementation All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.


Introduction
In sequence analyses, k-mers play an important role in various algorithms and approaches.For example, k-mers can be used as seeds for sequence similarity search, where a seed shared between two sequences acts as an anchor to identify similar regions between, e.g.DNA, RNA, or protein sequences.When used as seeds, k-mers enable rapid identification of shared regions and are used in a large number of short and long-read mapping algorithms (Alser et al. 2021, Sahlin et al. 2023), and other approaches for querying large sequence datasets (Marchet et al. 2021, Fan et al. 2024).
Both a feature and a limitation of using k-mers as seeds is that sequences must be identical for the seed to match.In biological data, it is common that mutations in DNA occur in the form of substituted, deleted, and inserted nucleotides.In addition, common DNA and RNA sequencing techniques are noisy and introduce additional altering of the nucleic acids.In order to provide anchors also in regions with high divergence, seeds are allowed to anchor over mutations.Alternatives to k-mers have therefore been explored extensively in the literature, such as spaced k-mers (Ma et al. 2002).See Sahlin et al. (2023) for an overview of several other seeding constructs used in read mapping.

Strobemers
Recently, we introduced strobemers, a novel class of seed constructs (Sahlin 2021a).Strobemers can produce seed matches across substitutions, insertions, and deletions, expanding on ideas from neighboring minimizer pairs (Chin andKhalak 2019, Sahlin andMedvedev 2021) and k-minmers (Ekim et al. 2021) that link neighboring minimizers (Roberts et al. 2004) into a seed.Strobemers generalize this linking by considering downstream k-mers as potential candidates to link, offering various methods such as minstrobes, randstrobes, and hybridstrobes (Sahlin 2021a), with randstrobes being the most effective.Randstrobes have been used, e.g. in for short-read mapping (Sahlin 2022), transcriptomic long-read normalization (Nip et al. 2023), and read classification (Xu et al. 2023).Our recent study also demonstrates that randstrobes provide accurate sequence similarity ranking using the Jaccard distance (Maier and Sahlin 2023).This study also revealed a strong correlation between strobemers' sensitivity and the pseudo-randomness of the seed construct, measured through entropy (Maier and Sahlin 2023).While additional strobemer variants have been introduced (Maier and Sahlin 2023), randstrobes remain the simplest and most widely used construct.Constructing randstrobes involves converting strings to integers using a hash function and selecting candidate k-mers for linking through a link function and comparator operator.Sampling biases (Fig. 1) in this process can affect sequence matching efficiency (Maier and Sahlin 2023).So far, the underlying operators to produce randstrobes have not been evaluated.

Our contribution
We design metrics suitable for detecting and measuring several types of bias in randstrobe construction methods (Fig. 1).Using the new evaluation metrics, we uncovered biases and limitations in previous randstrobe methods (Sahlin 2021a, 2022, Xu et al. 2023).We propose new methods to enhance the core operations (hashing, linking, and comparison), which improve seed uniqueness, sampling uniformity, and construction runtime.We also introduce a Binary Search Tree (BST)-based construction method, reducing time complexity and achieving comparable randomness but is much faster for some parametrizations.This is valuable for timecritical bioinformatics applications.
Additionally, we identify that the link function and comparator in the short-read mapper strobealign (Sahlin 2022) underperform in seed uniqueness compared to other methods.As a result, we modified strobealign to enhance accuracy.Although the modification does not improve the overall accuracy, an approach that selects the best alignment score per read from the modified and default versions of strobealign improves accuracy substantially.This finding can be used to further increase strobealign's accuracy.In summary, our evaluation uncovers linking biases and offers guidance on operator selection for randstrobe implementations.

Definitions
We use 0-indexed notation.We typically use S and T to denote strings, and we use the notation S½i : j�, i < j to refer to a substring starting at position i and ending (and including) the character at position j in S. We let the j � j operator denote the length of strings.Here, our alphabet consists of the letters (or nucleotides) Σ ¼ fA; G; C; Tg.We use hðxÞ !z, where x and z are integers to denote a hash function without specifying the underlying function.As for representation in memory, DNA strings shorter or equal to 32 nucleotides (nt) can be stored with 64-bit integers by encoding A, C, G, and T as 00, 01, 10, and 11, respectively.Other letters, such as N for "unknown" nucleotide, are ignored.For k-mers longer than 32 nt, we represent them as structs of (concatenated) 64-bit integers.We use the variable x to represent the integer value of the encoding.Finally, we use & for bitwise AND, � for bitwise XOR, | for concatenation (e.g.concatenating two 64bit integers into a 128-bit representation), and\% for the modulo operator.We also use BðxÞ to represent the function that returns the number of set bits in x.

An overview of constructing strobemers
A k-mer is a substring of k nucleotides in a biological sequence S. Consequently, a k-mer only needs the length of the substring, k, as a parameter to be specified.A strobemer is a set of linked k-mers.Specifically, a strobemer consist of n l-mers l 0 ; . . .; l n−1 , denoted strobes, where the first strobe l 0 has a determined position i in S. Downstream strobe l m , m 2 ½1; n − 1� is selected in an interval S½i þ w min þðm − 1Þw max : Figure 1.Illustration of a desired random sampling of the second strobe for strobemers consisting of two strobes (case A).Whenever a pseudo-random method is used to select the downstream strobe based on the first strobe, it generates some sampling bias.Cases B-E show different biases we observed in the sampling.The metrics we propose to measure the bias are displayed under each of the illustrations of cases B-E.
i þ mw max � in S, and linked (appending the strobe to previous strobes) to the m previous strobes.Here, w min and w max specify the range of the sampling window.For example, strobe l 1 is sampled in S½i þ w min : i þ w max � and linked to l 0 .Since we consider 64-bit integer representations of the strobes in this study, we will from now on refer to the strobes as x 0 ; x 1 ; . . .x n−1 and, when clear from context, we alternate x to mean either the strobe itself or its integer representation.This is also the reason we use the more general term linking instead of appending (strobes to the seed), as the linking method will vary with the strobe representation, as we discuss in detail in the next section.
The methods to select strobes differ (Sahlin 2021a).For example, Minstrobes have been used for long-read overlap detection (Firtina et al. 2023) and alternating strobe lengths have also been explored (Maier and Sahlin 2023).However, randstrobes were shown to be more sensitive for sequence matching than other methods using fixed strobe lengths (minstrobes and hybridstrobes) (Sahlin 2021a), and simpler to construct than alternating strobe lengths (altstrobes and multistrobes) (Maier and Sahlin 2023), and is so far most commonly implemented in practice (Sahlin 2022, Nip et al. 2023, Xu et al. 2023).Therefore, we will consider only the randstrobes method in this study.Randstrobes are parameterized by ðn; l; w min ; w max Þ.The novelty compared to, e.g.k-mers and spaced k-mers is that strobemers allow flexibility in the strobes' spacing and can produce matches between two sequences in a region with insertions or deletions.

Strobemer construction: constraints and objectives
Let M w max wmin ðx i jx i−1 ; . . .; x 0 Þ, or simply M when context is clear, be a method to sample a strobe x i in a window given by its parametrization ðn; l; w min ; w max Þ.We put the following constraints on M. 1) M selects x i based only on the sequence information of x i−1 ; . . .; x 0 .2) M is deterministic.That is, for two identical strings S and T, the same strobes are produced.
We want to find a method M such that 1) Maximize HðMðx i jx i−1 ; . . .; x 0 ÞÞ, where H denotes the entropy.Intuitively, M should sample x i as uniform as possible within the window, regardless of previous strobes and the sequence in the window.2) M constructs randstrobes as fast as possible.
The first constraint is essential to eliminate high-entropy but impractical solutions in sequence matching.For instance, using a (pseudo) random number generator (RNG) like rand () in Cþþ may seem to have good entropy.However, in scenarios involving similar strings S and T, where one has a deletion, the RNG is likely to generate different numbers upon encountering the deletion, making it unsuitable for string matching.Therefore, the method's decision should solely rely on the underlying sequence.
The first objective, instead, involves conditional entropy, which is challenging to measure.Merely assessing entropy by the uniformity of sampling sites within a sequence window is insufficient.For instance, if a method prefers selecting a strobe if it is identical to the previous strobe, and the distance between two identical strobes happens to be uniformly distributed across a sequence, the method may falsely appear to have perfect entropy.It is also worth noting that achieving high entropy is easier in randomly generated sequences, but the focus here is on repetitive regions common in biological sequences, where achieving sampling uniformity is more challenging.

Constructing randstrobes
The process of creating randstrobes can be separated into four modular components: 1) Hashing the strobes; 2) Linking the strobes; 3) Comparing the strobes during linking; 4) Construction of the final seed hash value.
We discuss each of the components below and suggest different methods to perform them.
Previously, h NO ðxÞ was used in Sahlin (2021a) and h TW ðxÞ was used (Sahlin 2022).This is the first study using h XX ðxÞ and h WY ðxÞ as hash functions to construct randstrobes.The hash functions xxHash and wyhash are general-purpose noncryptographic pseudo-random hash functions that hash bytes into an integer range of size 2 b for some b>0 (here, b ¼ 64).

Linking strobes
The second strobe x 1 is linked to the first strobe x 0 by selecting the candidate strobe x 0 1 in the window that minimizes or maximizes the link function '.For example, in the first strobemers study (Sahlin 2021a), two link functions were used.(Sahlin 2021b)].The second one was 'ðx 0 ; x 0 1 Þ ¼ ðx 0 þx 0 1 Þ&q, where q is a bitmask of 16 ones' on the lowest significant bits and remaining 0s [proposed as faster alternative in the final publication (Sahlin 2021a)].We call these functions ' MOD and ' AND , respectively.Furthermore, two additional link functions were described in Sahlin (2022) and Xu et al. (2023) that we denote ' BC and ' XOR .Here we propose three more alternatives: ' XV , ' CC , and ' MAMD .We provide formal definitions of all the link functions below.

The first was 'ðx
Similar to ' MOD but uses a BST (proposed in this study) The ' MAMD and ' MOD are theoretically nearly identical (see Supplementary Section 1).However, ' MAMD uses a BST to lower the time complexity.Consider a window of hash values.Roughly stated, the ' MAMD link function only needs four operations as we are sweeping the window over the sequence; find minimum element (no modulo wrap-around), find the closest element to a specific value (modulo wrap-around), add incoming element, and remove outgoing element.These operations can all be performed in logarithmic time with a BST.The ' MAMD link function is described in detail in Supplementary Section 1.We will discuss the computational complexity of all methods in Section 2.6.In this section, we only discussed linking the first two strobes.Linking additional strobes can be done recursively by applying the same link function between the previous resulting randstrobe hash value b with the next candidate downstream strobes x m , m > 2 as 'ðb; x m Þ.

Sampling comparator
The comparator function, here denoted cð�Þ, specifies the criteria for which we select strobe x 1 among candidates x 0 1 .To our knowledge, the only sampling comparator that has been proposed is c min ðx 2021a, 2022, Xu et al. 2023)), where W is the collection of strobes in the window defined by w min and w max .In this study, we propose c max ðx 0 ; The comparator can influence the result for some hash and link constructions as we will see in our benchmark.

The final seed hash value
We have so far discussed only how to select strobes.However, once the strobes have been decided, we need to represent the randstrobe with a final hash value.The final hash value is what should be indexed and queried, e.g. a seed-and-extend mapping framework.We denote the function to produce the final seed hash value as f ðx 0 ; . . .; x n Þ.We need the function f to be as uncorrelated with the link function as possible.If we would use the hash value that comes out of 'ðx 0 ; x 1 Þ, with, e.g.c min , we are projecting hash values to the minimum value in each window.This leads to unnecessary hash collisions compared to a uniform hash function.Furthermore, as mentioned in Sahlin (2021a), it is important to avoid symmetric functions f ðx inversions is important [although a symmetric function is used to forward and reverse complements seeds in, e.g.read mapping (Sahlin 2022)].Taking into consideration the above we use � This formulation allows f not to have any apparent correlation with any of the benchmarked link functions, as we will see in Section 3.

Linking more than two strobes
Generally, to link x m , to x 0 ; . . .x m−1 , m 2 ½2; n−1�, we use 'ðb; x 0 m Þ, where x 0 m are the candidate strobes in the window, and b denotes a base value calculated from the previous m strobes.We set the b equal to the previous strobes' final hash value, e.g.b ¼ f ðx 0 ; x 1 Þ and 'ðb; x 0 2 Þ in the case of three strobes.This method can be applied recursively.

Time complexity
Before discussing computational complexity, we make the following classifications of our link functions: � Cheap computation: This group includes ' MOD , ' AND , ' BC , ' XOR , and ' MAMD .We denote them as computationally cheap because the hashing and linking can be separated.
That is, we only need to calculate hash values once for each strobe, and the link function can be applied after.� Expensive computation: This group includes ' CC , and ' XV .For these methods, we need to evaluate the hash value for the combination of x 0 and all its candidate downstream strobes, for each new x 0 .
The time complexity of constructing randstrobes from a string of length jSj varies with the link-function class.Let t h be the time complexity for the hash function, n the number of strobes, and W ¼ w max − w min þ1 be the window size.Then, S − nw max − l þ 1 the number of randstrobes constructed from S. We assume that the linking operators (i.e.þ, &, �, mod , |) can be performed in constant time, although the practical runtime varies among the operators with � being cheaper to perform while | being relatively expensive.
Expensive computation methods perform ð1þnWÞ hash calculations, and nW other operations (such as þ, &, �, mod , |), per randstrobe.So the total complexity is If we assume that jSj � nw max − l þ 1 and t h ¼ Ωð1Þ (i.e. the complexity of t h is at least a constant), we can simplify the expression of the time complexity of expensive computation methods and cheap computation methods to OðjSjnWt h Þ, and OðjSjt h þ jSjnWÞ, respectively.
Lastly, the ' MAMD link function is part of the cheap computation category.However, the time complexity is further reduced to OðjSjt h þjSjn log WÞ through the logarithmic time complexity of searching for elements (see Supplementary Section 1 for details).While the BST implementation increases the constant coefficient through the BST overhead, we will see that the speed-up is substantial for large windows.We have abstracted over the exact time complexity of the hash functions.The cheapest computation is h NO which only streams over the sequence without performing hashing.Some hash functions also support streaming (Mohamadi et al. 2016) and can lower t h .

Evaluation metrics
There are different sampling biases that can arise as illustrated in Fig. 1.We were not able to find a singular metric that captured all of these biases, instead, we propose four suitable metrics that would capture cases B-E in Fig. 1.A desirable result is that the selection of the second (or any downstream) strobe is performed as uniformly in the window and as independently of the previous seed as possible.Several seed-based applications also require fast construction; therefore, we also benchmark construction runtime.

Notation for evaluation metrics
Let N be the total number of seeds constructed from a string S, and M the number of seeds with distinct final seed hash values in S. We let i and j be index variables over the set of randstrobes seeds sorted by their first strobe position.Since we here sample one randstrobe per position in S, the index variables are equivalent to the start position of the seed, and the N seeds can be ordered with respect to the start position on S. We let s ik refer to the kth strobe in seed i and p ik to its position in S.

E-hits
The E-hits metric was introduced in Sahlin ( 2022).It provides a number between 1 and jSj, which is the expected number of times a seed occurs in the reference.The E-hits metric was used as a measure for expected seed repetitiveness in S when sampling reads uniformly at random from a reference string S, assuming S is much larger than the span of the seed (Sahlin 2022).We restate the E-hits metric here for self-containment.Let i 2 ½1; M� be an index variable over the set of distinct seeds in S and N > M be the total number of seeds in S (multiset).Let x i denote the number of times seed i occurs in S. Let q i be the probability of producing seed i when selecting a seed randomly from the N seeds.The E-hits metric is then the expected value over seed hits E[X] computed as x 2 i : (1) In this study, seeds are represented as hash values.The above formula is equivalent if we replace the notion of a seed with the hash value representation of a seed.In this case, E-hits measure the expected number of identical hash values, which includes both repetitive seeds and non-desired hash collisions.We will measure the E-hits for the final seed hash values produced with f, and denote this quantity E f .This is the same use of E-hits as in Sahlin (2022).

E-hits of inter-strobe distance and strobe position
The idea and formulation of E-hits can be used to measure the repetitiveness of other quantities.To measure strobedistance clumping (bias B) and periodicity clumping (bias D) in Fig 1, we look at the distribution of inter-strobe distances within a randstrobe.Let d jk be the distance between the first strobe and the kth strobe in seed j.We let x i in Equation (1) be the number of times we observe distance d jk .Equation ( 1) then measures the expected number of times we observe the distance d jk when randomly drawing a seed from S. We denote this quantity as E d and omit index variable k when it is clear from the context.
We measure second-strobe clumping (bias C) by computing the repetitiveness of the position of kth strobes in S. Let x i in Equation ( 1) represent the number of times we observe the kth strobe selected at position p in S.Then, the E-hits formula measures the expected number of times position p was sampled as the kth strobe when drawing a seed uniformly at random from S. We denote this quantity as E p (omitting index variable k when clear from context).

The conflict metric
To study complex dependencies (termed other clumping; Case E) as depicted in Fig. 1, we introduce the conflict metric, which aims to measure the size of the overlaps of strobes from a set of neighboring randstrobes with start positions in [i, j], i < j.An overlap higher than what is expected under random sampling indicates selection bias.Let oði; j; kÞ ¼ maxð0; l − jp jk − p ik jÞ measuring the number of overlapping positions of the kth strobe between two randstrobes i and j.Then P n−1 k¼0 oði; j; kÞ is the total number of overlapping positions between two randstrobes.The conflict metric for randstrobe i is then defined as In other words, C i is the largest observed overlap with any of the m consecutive downstream randstrobe seeds.We let the conflict metric (C) be the value of C i averaged over all seeds in S. The above formula does not take into account that strobes of different orders (k) between neighboring randstrobes might overlap.However, even if this is possible for some values of w min , it does not originate from the bias that we want to measure, and can therefore be omitted.

Results
We evaluated all compatible combinations of '; c, and h.Some combinations, such as h TW with ' CC , are incompatible with strobes larger than 16 nucleotides (32 bits) because h TW is designed for 64-bit integers.We use a simulated highly repetitive sequence (SIM), a set of 20 Escherichia coli genomes (E20), and the CHM13 human chromosome Y from the T2T assembly (Nurk et al. 2022) (ChrY) to evaluate pseudo randomness for randstrobes with n ¼ 2. For runtime experiments, we used a simulated string of length 15 million.We also evaluated randstrobes n ¼ 3 on the SIM dataset.Details of the experiment design and rationale are found in the Supplementary Section 2.

Pseudo-randomness
As for pseudo-randomness, we observed similar trends for the methods across the SIM, E20 and ChrY datasets.We also observed that the three hash functions (h WY ; h TW ; h XX ) had very similar results, we therefore focus on presenting the data for the SIM dataset using only h WY compared to not hashing (h NO ) here.Results with all hash functions for SIM, E20, and ChrY are found in Supplementary Materials.Our benchmark highlights the following takeaways.
Hashing strobes: Always use a hash function to hash the strobes before linking (applicable to all link functions except ' CC ), otherwise most link functions will be subject to some form of severe bias (Fig. 2 and Supplementary Figs S1-S3).
Comparator: Comparator choice is only important for some link functions.Cheap computation XOR-based methods ' XOR and ' BC exhibit high bias with the c min comparator.This is because the c min comparator will select a candidate strobe to be identical to the previous strobe if present in the window (XOR value of 0) while c max will have the opposite behavior.Since our repeats in the SIM dataset have reoccurring distances between them (which also happens in biological sequences), it causes distance clumping (bias B) and negative positional clumping (bias C).

Seed repetitiveness
Seed repetitiveness in the reference is crucial for applications such as read mapping (Sahlin 2022, Ekim et al. 2023, Maier and Sahlin 2023, Shaw and Yu 2023).We use k-mers of length 40 nt, corresponding to the same number of sampled positions in the randstrobes, as a reference method in this benchmark.The k-mers are stored as two strobes with the same final function as the randstrobes, We first verified that using our final hash function f for seed representation resulted in minimal hash collisions (Supplementary Fig. S4).Since hash collisions were not significant, we computed the E-hits of the final seed hash value (E f ), for all methods.As with randomness, it is important to use a hash function before linking strobes (Fig. 3 and Supplementary Figs S1-S3).Additionally, we observed that  The window more often.As slow as expensive methods.
a Results are described under the assumption that a hash function is used to hash the strobes (applicable to all link functions except ' CC ).b Mentioned in Sahlin (2021a) but neither used nor studied.c Too much overhead to be used for small windows.
randstrobes generally have lower E f than k-mers for most hash and link functions, but repetitiveness can increase with specific combinations (Fig. 3).

Runtime performance
Figure 4 shows the construction time for window sizes using w max ¼ 100 and w max ¼ 1000, respectively.Expensive computation methods (' CC and ' XV ) are performing a factor of nW more hash computations.However, they are only about 2.5-4 times slower than the average cheap computation methods when using h WY as hash function (Fig. 4).One explanation could be cache efficiency.We also observe that the ' BC and ' MOD are substantially slower than other methods in the cheap-computation class.Finally, when constructing randstrobes with large windows, ' MAMD is much faster than other methods (Fig. 4, lower panels).This is due to the BST  Efficient randstrobes implementation instead of a linear search across each window.However, due to its special updating technique utilizing arithmetic properties of the modulo operator, the method can only be used with the modulo link function.As for the hash functions, h WY performs better than h XX and h TW on our data for the expensive computation class, where strobes are represented by a struct of two 64-bit integer strobes.

Randstrobes in large windows
The ' MAMD link function enables efficient construction of randstrobes in large windows.We were interested in the uniqueness of seeds that ' MAMD produced compared to one of the best-performing methods ' CC (using c max ).We used p ¼ 100; 001 in the previous analysis.For this analysis, we set p ¼ 19; 019; 684; 767; 739; 993.The value of p needs to be significantly larger than the window size but smaller than the maximum hash value to guarantee high pseudo-randomness.
To our knowledge, the value of p has no specific influence beyond that.We investigated the expected uniqueness (E-Hits) of the seeds computed across chromosome Y of the CHM13 assembly (Fig. 5, left panel).In the figure, a window size of 0 corresponds to k-mers of size 256.We make two key observations about the uniqueness of seeds.First, we note that there is no substantial difference between the two link functions on chromosome Y from the CHM13 assembly, including telomere regions and many repetitive multigene families.Second, we observe that the E-hits function is not linearly decreasing, which we initially expected.Minimum repetitiveness occurs at w max ¼ 2; 000 instead of the largest evaluated window at w max ¼ 10; 000.This is likely explained by the observation that, beyond a certain window size, the more likely it is that the same pair of strobes is found and linked.We also looked at how the runtime scaled with window size.Figure 5 (right panel) shows the median runtime from 10 runs on the E.coli genome of 5.5 million nucleotides.Our BST implementation greatly outperforms ' CC .

Implementing c max in strobealign
Strobealign is a read mapper that use randstrobes created from syncmers (Edgar 2021) using c min together with ' BC , which we observed were particularly bad in terms of seed uniqueness and randomness (Figs 2 and 3).Guided by our benchmark, we wanted to investigate whether c max would result in better mapping results.The experiment is described in detail in Supplementary Section 4. We did not observe a direct improvement in strobealign's accuracy when run with c max compared to the default version that uses c min (Supplementary Tables S1 and S2).However, we observed a large improvement in accuracy for the shorter read lengths when combining the results of the two runs of strobealign (details in Supplementary Section 4).

Discussion and conclusions
Constructing randstrobes involves four modular operations: computing individual strobe hash values (hash function), determining hash values for linked strobes (link function), selecting the final randstrobe from multiple candidates using a comparator function, and computing the hash value for the chosen randstrobe.The initial three operations (hash, link, and comparator functions) yield diverse results based on the combination of functions used.Our study introduced and benchmarked both novel and previously used hash, link, and comparator methods for randstrobe construction, accompanied by metrics to evaluate method biases.Our benchmark revealed biases in existing techniques and can offer general guidance for which methods to use when utilizing randstrobes as sequence comparison seeds.From our evaluation, we conclude the following.
� Hashing: Always hash the strobes before linking with a computationally cheap link method.It does not result in a large overhead in construction time (Fig. 4) while being beneficial for pseudo-randomness (Figs 2 and 3).The hash functions have roughly the same pseudo-randomness performance, but the h WY function had the best runtime.
A downside with hashing compared to the 2-bit encoding is that nucleotide level information of the seed is lost.This should be factored into the decision for the application at hand.� Linking: In short, we believe ' CC or ' XV should be used when highest pseudo-randomness is desired, ' XOR (with c max ) should be used when speed is important, and ' MAMD for use cases with very large windows (Table 1).We do not see any benefit with using ' AND and ' MOD over ' XOR .Finally, ' BC is a special function designed for when biased sampling is desired, as in Sahlin (2022).� Comparator: The comparator matters for some link functions (Table 1).For example, an XOR-based link-function projects identical hash values to 0. Therefore, a min comparator will select identical strobes if present, while a max comparator will be inclined to select differing strobes.Consequently, in repetitive regions with occasional variations (e.g.SIM dataset) where the window is larger than the repeat length, the min comparator will tend to collapse seeds while a max comparator has the opposite behavior.This however implies that in such regions, the max comparator will be less robust to sequencing errors in reads.These two effects pull in different directions when it comes to read mapping.We observed no substantial difference between them in strobealign (Supplementary Tables S1 and S2) but combining their results led to large improvement for shorter reads (Supplementary Tables S1 and S2).� Final seed hash value function: Choose a final seed hash value function that is uncorrelated to the link function to avoid hash collisions.For example, we used 2x 0 −x 1 for two strobes that did not show any apparent correlation with the link functions we benchmarked (Supplementary Fig. S4).

Future work
Efficiently applying hash and link functions can benefit cheap computation methods.A rolling hash function, like ntHash (Mohamadi et al. 2016), can enhance hash computation in these methods.This optimization proves valuable when hashing is relatively more expensive than linking, particularly for larger window sizes.Additionally, a link function ' MAMD was designed using arithmetic reasoning to reduce construction time complexity.Further investigation is needed to determine if the rolling hash approach allows for arithmetic operations permitting efficient linking methods.We observed improved accuracy when combining results from min and max comparators in strobealign.Our proof-ofconcept approach involved running strobealign twice and post-processing the alignments, resulting in slightly more than twice the runtime compared to a single run.To mitigate an increase in runtime, integrating seeds from both comparators into strobealign is a solution.This increases memory usage but may not affect runtime since costly rescue-alignment calls may lowered due to fewer regions without seed matches.

Figure 2 .
Figure 2. Results for metrics E d (upper panels), E p (middle panels), and C (lower panels) for randstrobes with parameter settings ðn ¼ 2; l ¼ 20; w min ¼ 21; w max ¼ 100Þ for the repetitive sequence dataset.The x-axis shows the different linking methods, and the min and max comparators are shown in left and right panels, respectively.We have normalized the values with a near ideal result produced by simulating strobes uniformly at random in the window with rand().Therefore, a value of 1.0 indicates best possible outcome (indicated by black dashed line).

Figure 3 .
Figure 3. Normalized E-hits of seed hash values for various to construct randstrobes with parameters ðn ¼ 2; l ¼ 20; w min ¼ 21; w max ¼ 100Þ compared to k-mers of size 40.Lower value is better.

Figure 5 .8
Figure 5.A comparison between ' MAMD and ' CC with parameters ðn ¼ 2; l ¼ 128; w min ¼ 129; w max ¼ xÞ, where x is plotted on the x-axis.Left panel shows E-hits on Chromosome Y from the CHM13 human assembly (Nurk et al. 2022).The right panel shows median runtime out of 10 runs on an E.coli genome of 5.5 million nucleotides.

Table 1 .
Overview of link functions and comparator functions based on the results from our experiments.a